data artifact
RADAR: Benchmarking Language Models on Imperfect Tabular Data
Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness--the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies--remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.1
RADAR: Benchmarking Language Models on Imperfect Tabular Data
Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness--the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies--remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2,980 table-query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.
RADAR: Benchmarking Language Models on Imperfect Tabular Data
Gu, Ken, Zhang, Zhihan, Lin, Kate, Zhang, Yuwei, Paruchuri, Akshay, Yu, Hong, Kazemi, Mehran, Ayush, Kumar, Heydari, A. Ali, Xu, Maxwell A., Narayanswamy, Girish, Liu, Yun, Poh, Ming-Zher, Yang, Yuzhe, Malhotra, Mark, Patel, Shwetak, Palangi, Hamid, Xu, Xuhai, McDuff, Daniel, Althoff, Tim, Liu, Xin
Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.
No More Distractions: an Adaptive Up-Sampling Algorithm to Reduce Data Artifacts
Researchers recently found out that sometimes language models achieve high accuracy on benchmark data set, but they can not generalize very well with even little changes to the original data set. This is sometimes due to data artifacts, model is learning the spurious correlation between tokens and labels, instead of the semantics and logic. In this work, we analyzed SNLI data and visualized such spurious correlations. We proposed an adaptive up-sampling algorithm to correct the data artifacts, which is simple and effective, and does not need human edits or annotation. We did an experiment applying the algorithm to fix the data artifacts in SNLI data and the model trained with corrected data performed significantly better than the model trained with raw SNLI data, overall, as well as on the subset we corrected.
Probabilistic detection of short events, with application to critical care monitoring
We describe an application of probabilistic modeling and inference technology to the problem of analyzing sensor data in the setting of an intensive care unit (ICU). In particular, we consider the arterial-line blood pressure sensor, which is subject to frequent data artifacts that cause false alarms in the ICU and make the raw data almost useless for automated decision making. The problem is complicated by the fact that the sensor data are acquired at fixed intervals whereas the events causing data artifacts may occur at any time and have durations that may be significantly shorter than the data collection interval. We show that careful modeling of the sensor, combined with a general technique for detecting sub-interval events and estimating their duration, enables effective detection of artifacts and accurate estimation of the underlying blood pressure values.
PADL: portable PyTorch pipelines facilitating deep-learning model use
Programs are read more often than they are written. Models are used more often than they are trained. The PyTorch, and the deep-learning ecosystem in general, abounds with tools for training models, and squeezing the best performance out of computational resources in doing this. In the life cycle of a model this is only the beginning of the journey. Once a model has been trained, it will be shared, and used in a multitude of contexts, often on a daily basis, in operations, evaluation, comparision and experimentation by data scientists.
Launch of the SandLabs Project
The SandLabs Team currently consists of Wyatt Walsh and Ryan Epprecht. Having met in high school, this dynamic duo has a rich history together and each member brings a rich set of experiences and skills to the team. Navigate to their various profiles if you are interested in learning more about Wyatt or Ryan. SandLabs aims to explore the blockchain domain via a data scientific lens to generate new insights and make helpful contributions to the BlockchainxData communities and beyond. The initial focus of our work will be data collection, extraction, and processing high-quality data for future use.
Starting your data science project with Metaflow? The MNIST use-case.
The priority of data scientists simply lies in picking out the right features, building and deploying their models, they do not like to be particularly bothered about other aspects like model versioning, job scheduling, flow architecture, compute resources management, which is needed to make operationalizing data science successful. Metaflow is an open-source tool by Netflix for managing data science workflows. It aims to boost the productivity of data scientists by allowing them to focus on actual data science work and by facilitating faster productionization of their deliverables. If you are familiar with Airflow or Luigi then you would understand the function of Metaflow. It allows you to run your data science process in steps, so each step is a node in the process and the nodes are connected like a graph as seen below.
Probabilistic detection of short events, with application to critical care monitoring
Aleks, Norm, Russell, Stuart J., Madden, Michael G., Morabito, Diane, Staudenmayer, Kristan, Cohen, Mitchell, Manley, Geoffrey T.
We describe an application of probabilistic modeling and inference technology to the problem of analyzing sensor data in the setting of an intensive care unit (ICU). In particular, we consider the arterial-line blood pressure sensor, which is subject to frequent data artifacts that cause false alarms in the ICU and make the raw data almost useless for automated decision making. The problem is complicated by the fact that the sensor data are acquired at fixed intervals whereas the events causing data artifacts may occur at any time and have durations that may be significantly shorter than the data collection interval. We show that careful modeling of the sensor, combined with a general technique for detecting sub-interval events and estimating their duration, enables effective detection of artifacts and accurate estimation of the underlying blood pressure values. Papers published at the Neural Information Processing Systems Conference.